Portuguese wine

V2B Group-7

Research questions

  1. Is there a correlation between the chemical properties of wine and the perceived rating. Can we (accurately) predict the amount of points given when only taking into account the chemical properties?
  2. To what extent does the soil on which the grapes are grown have an influence on the different chemical aspects of a given wine?
  3. Can interesting correlations between different features be found using k-means clustering?

Contents:

1) Import Necessary Libraries

To start off, the necessary libraries are imported such as numpy, pandas, matplotlib, holoviews, panel and seaborn.

2) Read in and Explore the Data

Here we explore the data from the csv file

Column types:

To make this data more visible we put the column names in a csv file with the column type and data type

Description

Here we describe different column names

The categories whose actual datatype does not match our required datatypes are listed as follows:

Outliers:

The outliers are calculated as 3 times the std above the mean value from a column.
As we can see in the sample below, there are 20 red wines with a exceptionally high price.

The outliers from price below are determined by those values that are 3 times above the std of the mean of this column.

Dropping rows based on their value

Here we drop the wines from the dataframe with a higher value of chloride than legally allowed in Australia. In other countries the maximum value varies "When wine contains excess sodium (excess sodium is equal to the content of sodium ions less the content of chloride ions expressed as sodium), it is generally less than 60 mg/L, a limit which may be exceeded in exceptional cases." - https://www.oiv.int/public/medias/2604/oiv-ma-d1-03.pdf

Some observations:

3) Data Analysis/Cleaning Data

Here we look at how complete and clean the data is.
First we look at all nominal/ordinal values.

Below we take a look at all the data with discrete and continuous values.

Here we can see the alcohol feature type being string, this is however not very usable. To be able to properly use the alcohol feature we will convert it to float.

(Missing) Value correction

This section changes the value types of Alcohol, Density and Citric acid to useable and readable values.

Below we we can see that the first value is " . ".
Values like these cannot be converted to numbers, so NaN values will be used.
A lot of these last values are way too big for density values.
These should be around 1.0, so the decimal separators will be placed after the first digit.

Below we can see that the first value is " - " and we will replace this in the next part.
Also these values are strings and we convert these to floats.

Year extraction from title

Here we extract the numbers from the title and check if this number is a year, we also checked if the amount of wines with a year number in the title is usable.

4) Data visualisation

Baseline model

To create a base outcome for our first research question, we take a look at out target variable: points. We calculate the mean of all given points of the wines. This will be the most likely outcome with our simple model.

Below we see a visual representation of the outliers of the column density.
As we can see, this column has outliers on both the most upper and lower ends.

In the graph below we take a look at the amount of wines per province.
Most wines clearly originate from the first two provinces, Douro and Alentejano.

In the graph below we see the amount of wines per alcoholic percentage.
As we can see, the wines are within the range of 9 to 15 percent alcohol.

In the next graph the correlation between density and residual sugar is visualized.
There seems to be a small increase in residual sugar, the denser the wine gets, with a few outliers.

In the upcoming model, we see the correlation between the year of release and the given points.
There are a few outliers, but they don't seem to mean anything.

Just as the graph before, here we see a scatterplot between the year of release and the alcohol percentage of a wine.
There might be a slight increase in alcohol, but it seems negligable.

When we look at the right side of the graph below we can see that when the concentration chlorides rises the average amount of points falls. We can also see in this graph that there are no wines with more than 600mgs of chloride.

Below we see a histogram of the frequency of points given to a wine, which seems to have the shape of a Gaussian distribution.

5) Choosing the Best Model

Next we research if there is a correlation between the chemical propperties in a wine and the points it is given.

As seen above, there seems to be a slight increase in points given to a wine, when it is more expensive.

Now we try every possible combination of features and check wich one has the lowest RMSE value. These features will be used for the linear regression model.

Extra Dataset

To what extent does the soil on which the grapes are grown have an influence on the different chemical aspects of a given wine?

To research this factor we found a soil test dataset of portugal. In this Dataset there is a longitude and latitude for each test and the outcome (soil type) with it. To make this more useable we edit this datapack down bellow.

Now we can test if the longitude and the latitude are useable. We use the google maps api for this.

Here we can see that there is allot of info to get from these requests so now we get all the info for each test in the geo test database

The data is not verry clean so we save everything from every different ground test_location i an dataframe. So we can loop through every piece of adress and name there is and see of there is a province name in it.

Now we can make a list of the provinces with a soil test and what soil type did come most out of the test for this prvince.

Now we can filter every wine that has a province with a known soil type.

The Pairplot is to find out if one of the combinations of chemicals has a verry seperating factor on the colours of the provinces and herb the soiltypes, but this is not the case so the conclusion is that there is little effect on the wine when the soil is different. Although there are a few things that can be done better:

Clustering using k-means

Here we cluster our dataset using different discrete/continuous features as the axes.
We have tried many different combinations, of which almost all produce uninteresting clusterings.
Below is our methodology, along with a few examples of uninteresting clusterings.

Interesting clustering:
One interesting clustering we found is when using the features "Residual sugar" and "Chlorides".
This clustering seems to imply some sort of correlation, where the following statements hold.

Conclusion
  1. Is there a correlation between the chemical properties of wine and the perceived rating. Can we (accurately) predict the amount of points given when only taking into account the chemical properties?
  2. To what extent does the soil on which the grapes are grown have an influence on the different chemical aspects of a given wine?
  3. Can interesting correlations between different features be found using k-means clustering?

  1. This answer would be no because only 30.76% are predicted successfully.
  2. After multiple combinations of different variables we can conclude that there aren't any meaningful clusters to be found. The most interesting one is the plot between residual sugar and chlorides where, if the wine has a high amount of residual sugar, then there are hardly any chlorides. Conversely, if the wine has a high amount of chlorides, then there is hardly any residual sugar.

Created in deepnote.com Created in Deepnote